Recommending Movies

The MovieLens 20M dataset contains 20 million user ratings from 1 to 5 of thousands of movies. In this demo we'll build a simple recommendation system which will use this data to suggest 25 movies based on a seed movie you provide.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection()

Download the MovieLens 20M data

We'll start by using some command-line tools to download and decompress the data.



In [2]:

    
%%bash
mkdir -p /mldb_data/data
curl "http://public.mldb.ai/ml-20m.zip" 2>/dev/null  > /mldb_data/data/ml-20m.zip
unzip /mldb_data/data/ml-20m.zip -d /mldb_data/data









    



Archive:  /mldb_data/data/ml-20m.zip






    



replace /mldb_data/data/ml-20m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)



In [3]:

    
%%bash
head /mldb_data/data/ml-20m/README.txt









    



Summary
=======

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.



In [4]:

    
%%bash
head /mldb_data/data/ml-20m/ratings.csv









    



userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940

Load the data into MLDB

See the Loading Data Tutorial guide for more details on how to get data into MLDB.

Here we load a text file and use the pivot aggregator to create a sparse matrix representation of the ratings.



In [5]:

    
%%time

print mldb.put('/v1/procedures/import_mvlns', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/ratings.csv",
        "outputDataset": "mvlns_ratings_csv",
        "runOnCreation": True
    }
})

print mldb.put('/v1/procedures/process_mvlns', {
    "type": "transform",
    "params": {
        "inputData": """
            select pivot(movieId, rating) as *
            named userId 
            from mvlns_ratings_csv
            group by userId
            """,
        "outputDataset": "mvlns_ratings",
        "runOnCreation": True
    }
})









    



<Response [201]>
<Response [201]>
CPU times: user 18.4 ms, sys: 4.22 ms, total: 22.7 ms
Wall time: 1min 19s

Take a peek at the dataset

We'll use the Query API. Each row is a user, each column is a movie, and the cell value is the rating the user gave the movie.



In [6]:

    
mldb.query("select * from mvlns_ratings limit 3")









    Out[6]:






  
    
      
      21
      58
      69
      163
      246
      357
      441
      471
      858
      908
      ...
      3255
      3260
      3386
      3418
      3425
      3654
      3713
      3865
      3879
      3897
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      138492
      3
      4.5
      4.5
      3.5
      4
      4.5
      5
      4.5
      5
      4
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      138491
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      138490
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      4
      4
      3
      4
      2
      4
      4
      1
      3
      4
    
  

3 rows × 250 columns

Singular Value Decomposition (SVD)

We will create and run a Procedure of type svd.train. This creates an embedding dataset where each row is a movie and the columns represent coordinates in a 100-dimensional space. Similar movies end up closer to each other than dissimilar movies.



In [7]:

    
print mldb.put('/v1/procedures/mvlns_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : "select COLUMN EXPR (where rowCount() > 3) from mvlns_ratings",
        "columnOutputDataset" : "mvlns_svd_embedding",
        "modelFileUrl": "file://models/mvlns.svd",
        "functionName": "mvlns_svd_embedder",
        "runOnCreation": True
    }
})









    



<Response [201]>

Explore the results!

Our dataset has movieIds but humans think about movie names so we'll load up the movie names in a dataset.



In [8]:

    
from ipywidgets import interact, interact_manual
from uuid import uuid4

print mldb.put('/v1/procedures/import_movies', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/movies.csv",
        "outputDataset": "movies",
        "select": "title, movieId",
        "named": "movieId",
        "runOnCreation": True
    }
})









    



<Response [201]>

A simple search function to find all movies (and corresponding movieIds) whose names contain a string.



In [9]:

    
@interact
def movie_search(x = "toy story"):
    return mldb.query("select title from movies where regex_match(lower(title), '.*%s.*')" % x.strip().lower())









    






  
    
      
      title
    
    
      _rowName
      
    
  
  
    
      115875
      Toy Story Toons: Hawaiian Vacation (2011)
    
    
      3114
      Toy Story 2 (1999)
    
    
      120474
      Toy Story That Time Forgot (2014)
    
    
      106022
      Toy Story of Terror (2013)
    
    
      78499
      Toy Story 3 (2010)
    
    
      1
      Toy Story (1995)
    
    
      120468
      Toy Story Toons: Partysaurus Rex (2012)
    
    
      115879
      Toy Story Toons: Small Fry (2011)

Now let's create a dataset to hold user preferences, and a simple function to simulate a user rating movies they like and movies they dislike, based on the movie_search function above.



In [10]:

    
print mldb.put("/v1/datasets/mvlns_user_prefs", {"type": "sparse.mutable"})

print mldb.put("/v1/functions/preferences", {
    "type": "sql.query",
    "params": {
        "query": "select {*} as p from mvlns_user_prefs where rowName()=$user"
    }
})

def save_prefs(user_id, likes, dislikes):
    for rating, search_terms in zip([5,1],[likes, dislikes]):
        for x in search_terms.split(","):
            if len(x) > 3:
                mldb.post("/v1/datasets/mvlns_user_prefs/rows", {
                     "rowName":user_id, 
                     "columns": [[str(m), rating, 0] for m in movie_search(x).index]
                })
    mldb.post("/v1/datasets/mvlns_user_prefs/commit", {})

save_prefs("janedoe", "Toy Story", "Terminator")
mldb.query("select preferences({ user: 'janedoe' })[p] as *")









    



<Response [201]>
<Response [201]>






    Out[10]:






  
    
      
      1
      589
      1240
      3114
      4934
      6537
      68791
      78499
      102425
      106022
      115875
      115879
      120468
      120474
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      result
      5
      1
      1
      5
      1
      1
      1
      5
      1
      5
      5
      5
      5
      5

With all that done, we can now build a recommendation engine out of a simple SQL query by mapping a user's preferences into the same space as the movie embeddings (i.e. embedding the user's preferences) and looking for the nearest movies.



In [11]:

    
print mldb.put("/v1/functions/nearest_movies", {
    "type": "embedding.neighbors",
    "params": {
        "dataset": "mvlns_svd_embedding",
        "defaultNumNeighbors": 25,
        "columnName": "embedding"
    }
})

print mldb.put("/v1/functions/recommendations", {
    "type": "sql.query",
    "params": {
        "query": """
            select nearest_movies({ 
                    coords: mvlns_svd_embedder({
                        row: preferences({ user: $user })[p]
                    })[embedding] 
                })[distances] as r
        """
    }
})









    



<Response [201]>
<Response [201]>

Here's a simple function which lets you simulate the results of liking and disliking certain movies and getting back the resulting recommendations.



In [12]:

    
def recommend(likes="Toy Story, Terminator", dislikes="Star Trek"):
    
    # here we simulate a new user saving these preferences
    user_id = str(uuid4())
    save_prefs(user_id, likes, dislikes)
    
    # we can then run an SQL query to:
    #   - retrieve recommendations
    #   - transpose and join them to movies to get titles
    #   - exclude the already-rated movies from the result
    return mldb.query("""
        select m.title 
        named m.movieId
        from 
            transpose(( select recommendations({ user: '%(user)s' }) )) as r 
            join movies as m on r.rowPathElement(2) = m.rowPathElement(0)
        where m.movieId not in (keys of preferences({ user: '%(user)s' })[p])
        order by r.result
        """ % dict(user=user_id))

recommend(likes="Toy Story, Terminator", dislikes="Star Trek")









    Out[12]:






  
    
      
      m.title
    
    
      _rowName
      
    
  
  
    
      316
      Stargate (1994)
    
    
      1682
      Truman Show, The (1998)
    
    
      2617
      Mummy, The (1999)
    
    
      1222
      Full Metal Jacket (1987)
    
    
      2355
      Bug's Life, A (1998)
    
    
      6365
      Matrix Reloaded, The (2003)
    
    
      3527
      Predator (1987)
    
    
      2329
      American History X (1998)
    
    
      292
      Outbreak (1995)
    
    
      145
      Bad Boys (1995)
    
    
      4886
      Monsters, Inc. (2001)
    
    
      2000
      Lethal Weapon (1987)
    
    
      1784
      As Good as It Gets (1997)
    
    
      434
      Cliffhanger (1993)
    
    
      1961
      Rain Man (1988)
    
    
      349
      Clear and Present Danger (1994)
    
    
      2115
      Indiana Jones and the Temple of Doom (1984)
    
    
      5445
      Minority Report (2002)
    
    
      594
      Snow White and the Seven Dwarfs (1937)
    
    
      1641
      Full Monty, The (1997)
    
    
      2003
      Gremlins (1984)

Here's an interactive form that lets you play with this function to see if you agree with the recommendations!

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.



In [13]:

    
interact_manual(recommend)









    






  
    
      
      m.title
    
    
      _rowName
      
    
  
  
    
      316
      Stargate (1994)
    
    
      1682
      Truman Show, The (1998)
    
    
      2617
      Mummy, The (1999)
    
    
      1222
      Full Metal Jacket (1987)
    
    
      2355
      Bug's Life, A (1998)
    
    
      6365
      Matrix Reloaded, The (2003)
    
    
      3527
      Predator (1987)
    
    
      2329
      American History X (1998)
    
    
      292
      Outbreak (1995)
    
    
      145
      Bad Boys (1995)
    
    
      4886
      Monsters, Inc. (2001)
    
    
      2000
      Lethal Weapon (1987)
    
    
      1784
      As Good as It Gets (1997)
    
    
      434
      Cliffhanger (1993)
    
    
      1961
      Rain Man (1988)
    
    
      349
      Clear and Present Danger (1994)
    
    
      2115
      Indiana Jones and the Temple of Doom (1984)
    
    
      5445
      Minority Report (2002)
    
    
      594
      Snow White and the Seven Dwarfs (1937)
    
    
      1641
      Full Monty, The (1997)
    
    
      2003
      Gremlins (1984)

Where to next?

Check out the other Tutorials and Demos.



In [ ]:

	21	58	69	163	246	357	441	471	858	908	...	3255	3260	3386	3418	3425	3654	3713	3865	3879	3897
_rowName
138492	3	4.5	4.5	3.5	4	4.5	5	4.5	5	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
138491	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
138490	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4	4	3	4	2	4	4	1	3	4

	title
_rowName
115875	Toy Story Toons: Hawaiian Vacation (2011)
3114	Toy Story 2 (1999)
120474	Toy Story That Time Forgot (2014)
106022	Toy Story of Terror (2013)
78499	Toy Story 3 (2010)
1	Toy Story (1995)
120468	Toy Story Toons: Partysaurus Rex (2012)
115879	Toy Story Toons: Small Fry (2011)

	m.title
_rowName
316	Stargate (1994)
1682	Truman Show, The (1998)
2617	Mummy, The (1999)
1222	Full Metal Jacket (1987)
2355	Bug's Life, A (1998)
6365	Matrix Reloaded, The (2003)
3527	Predator (1987)
2329	American History X (1998)
292	Outbreak (1995)
145	Bad Boys (1995)
4886	Monsters, Inc. (2001)
2000	Lethal Weapon (1987)
1784	As Good as It Gets (1997)
434	Cliffhanger (1993)
1961	Rain Man (1988)
349	Clear and Present Danger (1994)
2115	Indiana Jones and the Temple of Doom (1984)
5445	Minority Report (2002)
594	Snow White and the Seven Dwarfs (1937)
1641	Full Monty, The (1997)
2003	Gremlins (1984)